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ABSTRACT 

Summary: In anticipation of the individualized proteomics era and the 
need to integrate knowledge from disease studies, we have augmen- 
ted our peptide identification software RAId.DbS to take into account 
annotated single amino acid polymorphisms, post-translational modi- 
fications, and their documented disease associations while analyzing 
a tandem mass spectrum. To facilitate new discoveries, RAId.DbS 
allows users to conduct searches permitting novel polymorphisms. 
Availability: The webserver link is http://www.ncbi.nlm.nih.gov/ 
/CBBResearch/qmbp/raid.dbs/index.html. The relevant databases 
and binaries of RAId.DbS for Linux, Windows, and Mac OS X are 
available from the same web page. 
Contact: yyu@ncbi.nlm.nih.gov 

Introduction 

Like single nucleotide polymorphisms (SNPs) that occur roughly 
every 300 base pairs (Collins et al, 1998), single amino acid 
polymorphisms (SAPs) also differentiate individuals from one ano- 
ther. In addition to from nonsynonymous SNPs, SAPs may result 
from post-transcriptional regulations such as mRNA editing. SAPs 
together with post-translational modifications (PTMs) often distin- 
guish healthy/diseased forms of proteins. Integration of this annota- 
ted, disease-related knowledge with data analysis facilitates speedy, 
dynamic information retrieval that may significantly benefit clinical 
laboratory studies. 

To incorporate knowlege information within peptide searches, we 
start by constructing a human protein database where information 
about annotated SNPs, SAPs, PTMs, and their disease associations 
(if any) are integrated. We have also modified our peptide identifi- 
cation software RAId_DbS (Alves et al, 2007a) to take into account 
this additional information while performing peptide searches. Con- 
sequently, part of our work may be considered an improvement 
over that of Schandorff et al. (2007) who extended the human pro- 
tein database to include only SAPs but without PTMs and without 
integration of disease information. 

Besides using our web server, a user may also download a stan- 
dalone executable to be installed on her/his local machines. Once a 
user chooses to do so, she/he will find an important feature of the 
standalone version: the flexibility for users to add their own SAP 
and/or PTM information to various proteins they are interested in 
and even to add new protein sequences to the database. 

Implementation Summary 

In addition to giving a brief introduction to our software RAIdJDbS 
and its augmentation, we focus in this section on explaining how 
we accommodate the SAPs, PTMs, and their disease associations 

*to whom correspondence should be addressed: yyu@ncbi.nlm.nih.gov 



in our database. Appropriate comparison to existing approaches 
will also be discussed. Prior to database construction, we per- 
form a information-preserved protein clustering (see supplementary 
information). 

Database Construction To minimize inclusion of less confident 
annotations, we only keep the SAPs and PTMs that are consistently 
documented in more than one source. For example, for proteins with 
Swiss-Prot accession number, we only keep the SAPs and PTMs 
that are annotated both by Swiss-Prot and GeneBank. For proteins 
without Swiss-Prot accession numbers, the retentions of SAPs and 
PTMs are described in the supplementary information. 

A typical sequence in our augmented human protein database 
carries with it annotated SAPs and PTMs in a simple format, see 
Figure 1 and its caption. Our data format minimizes redundancy. For 
example, if a single site contains two SAPs, construction method 
proposed by Schandorff et al. (2007) will demand two almost iden- 
tical partial sequences, each may be several tens of amino acids in 
length, be appended after the primary sequence, while in our case it 
only takes up a few additional bytes. The compactness of our data- 
base becomes obvious when incorporating the information of two 
nearby sites, each containing several annotated SAPs and PTMs, 
into the database. In our construction, we only need a few additio- 
nal bytes. But in other approaches, it may introduce a combinatorial 
expansion due to including/excluding and pairing of different varia- 
tions at both sites along with the flanking peptides. Another key 
difference between our method and other database methods is that 
we do not need to limit the number of enzymatic miscleavages. 

When needed, users of RAId_DbS may modify the database, add 
new sequences, or even create their own databases following the 
same format. There is a separate information file that contains the 
protein accession numbers, detailed SAP and PTM information, 
and disease associations. If one wishes to add additional SAPs or 
PTMs, one simply updates both the ASCII database file as well as 
the information file. When reporting a hit with annotated SAPs or 
PTMs, RAIdJDbS automatically reports the corresponding detailed 
information and disease association if it exists. 

RAld_DbS Augmentation Taking into account the finite sample 
effect and skewness, the asymptotic score statistics (P-values) of 
RAIdJDbS (Alves et al,, 2007a) is derived theoretically. The final 
i5-value for each peptide hit, however, is obtained by multiplying 
the peptide's P-value by the number of peptides of its category. 
RAIdJDbS then ranks peptide hits according to i5-values, not P- 
values. This avoids overstating the significance of a hit from a 
larger effective database and is particularly helpful in reducing false 
positives when we allow SAPs and PTMs in the searches (see the 
supplementary information for details). 
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Table 1. Example search results of augmented RAIcLDbS. 



(a) P-value P-value Peptide Mol. Wt. Protein ID Novel SAP Disease 

1.184e-01 1.744e-05 RTKLKDC. . . KIAR 2897.500 (NP.l 14412;. .. ;Q9H2L5) disabled 



4.084e+00 9.345e-03 KQQELAA. . . VSSR 2898.520 (NP.072096;. . . ;O75420) disabled 



(b) iS-value P-value Peptide Mol. Wt. Protein ID Novel SAP Disease 

3.977e-07 1.834e-10 KsVEEYANCHLAR 1448.650 (NP.001054;. . . ;P02787) disabled 
4.779e-01 2.205e-04 KsVqEYANCHLAR 1447.670 (NP.001054;. .. ;P02787) disabled 

7.524e-01 3.470e-04 RflVINAsMVWAQAAR 1448.720 (NP.000337;. . . ;P48436) disabled {(£;2;108;Campomelic dysplasia (CMD1) [MIM:1 14290]) 

(s;6;112;Campomelic dysplasia (CMD1) [MIM:1 14290]) } 

RAId_DbS by default perform searches considering the parent ion to have charge +1, +2, +3. The search results are then pooled together to form a single result ranked by 
P-values. This is why for the same spectrum RAId _DbS may report peptide hits with very different masses. In part (a) ((b)), searches were done with annotated SAPs and PTMs 
turned off (on). The lowercase letters in the peptide indicate SAPs. A novel SAP, if present and enabled in the searches, will be specified in the column headed by Novel SAP. 
Note that in the disease related annotation, there are four fields separated by three semicolons. The first field, a lower case amino acid letter, indicates the SAP; second field, an 
integer, indexes the SAP position in the peptide; the third field, an integer, indexes the SAP position in the protein; the fourth field shows the annotated disease association. 



In addition to performing database searches that consider annota- 
ted SAPs and PTMs, we also allow users to include for consideration 
one novel SAP per peptide. This feature, helpful for scrutinizing 
otherwise unidentifiable spectra, will increase the effective peptide 
database size. However, it does not cause harm since the iJ-values 
associated with peptide hits containing novel SAPs are obtained by 
multiplying their P-values with a much larger number than that for 
peptide hits without novel SAPs (see supplementary information for 
details). 
Example 

Using a tandem mass (MS 2 ) spectrum taken from the profile data- 
set described earlier (Alves et ah, 2007b), we illustrate in Table 1 
two search results in the human protein database with the annota- 
ted SAPs and PTMs turned off (a) and on (b) respectively. In case 
(a), the best hit is a false positive with i5-value about 0.11 imply- 
ing that one probably ends up declaring no significant peptide hit 
for this spectrum. In case (b), however, the best hit is a true posi- 
tive (a peptide from human transferin with an annotated SAP) with 
Z?-value about 4.0 x 10~ 7 . This example shows that if properly 
used, allowing S APs/PTMs may increase peptide identification rate. 
That is, it may be fruitful to turn on the SAPs/PTMs when a regu- 
lar search returns no significant hit. Blindly turn on SAPs/PTMs, 
however, may cause loss of sensitivity due to the increase of search 
space. In supplementary information, using the 54 training spectra 
of PEAKS, we compare RAIdJDbS's peptide identifications with 
and without SAPs/PTMs. The purpose is to study the degree of loss 
in senstivity when turning on the SAPs/PTMs. Although for the 
data set tested there is no obvious loss in sensitivity (perhaps due 

MLLATLLLLLLGGALAHPDRIIFPNHACEDPPAVLLEVQGTLQRPLVR({WOO})D 
SRTSPAN((N08,N09,N10,N11,N12))CTWLILGSKEQTVTIRFQKLHLACGSERL 
TLRSPLQPLISLCEAPPSPLQLPGGN((N08,N09,N10,N11,N12))VTITYSYAGA 
RAPMGQGFLLSYSQDWLM({VOO})CLQEEFQCLNHRCVSAVQR [ 

Fig. 1. Protein sequence (NP.054764) used as an example to demonstrate 
our database structure. A "[" character is always inserted after the last amino 
acid of each protein to serve as a separator. Annotated SAPs and PTMs asso- 
ciated with an amino acid are included in a pair of angular brackets following 
that amino acid. SAPs are further enclosed by a pair of curly brackets while 
PTMs are further enclosed by a pair of round brackets. Amino acid follo- 
wed by two zeros indicates an annotated SAP. Every annotated PTM has a 
two-digit positive integer that is used to distinguish different modifications. 



to the statistical accuracy of RAId_DbS), we recommend running 
searches with SAPs/PTMs on only when a regular search returns no 
significant hit. 

Conclusion 

To enable speedy information retrieval and to enhance the protein 
coverage while analyzing MS 2 peptide spectra, we have augmented 
the capability of RAId_DbS and integrated with protein database 
additional information such as SAPs, PTMs, and disease annotati- 
ons. Incorporation of known SAPs and PTMs during initial searches 
may enhance the peptide identification rate. Integration of disease 
knowledge and information may be crucial in many time-pressed 
clinical uses. 

We are currently investigating the possibility of combining 
various isoforms of proteins into a single entry in addition to cluste- 
ring almost identical proteins. We are experimenting with keeping 
the longest form of the protein and marking at the beginning of the 
sequence possible deletions. Once achieved, this enhancement will 
further reduce redundant searches which should result in a shorter 
run time. Another objective is to cover more organisms. Currently, 
we have finished database construction of 17 organisms, including 
Homo sapiens, Drosophila melanogaster, Saccharomyces cerevisiae 
etc. (see supplementary information for details). We will provide 
more organismal databases on our web server once constructed. 
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SUPPLEMENTARY INFORMATION 

Information-preserved Protein Clustering and Database 
Construction 

We extract 34, 197 human protein sequences with a total of 
16,814,674 amino acids from the file (last updated 09/05/2006) 
ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/protein/protein.gbk.gz. 
Each protein sequence is accompanied by a list of annotated SAPs 
and PTMs. Out of the 34, 197 proteins, we found 29, 979 uni- 
que proteins with a total of 15,324,913 amino acids. To avoid 
having multiple copies of identical or almost identical proteins in 
the database, we first cluster the 34, 197 sequences by running 
an all-against-all BLAST. Two sequences with identical lengths 
and aligned gaplessly with less than 2% mismatches are clustered 
together, and each sequence is called a qualified hit of the other. 
Any other sequence that satisfies this condition with a member of 
an existing cluster is assigned to that existing cluster. All the anno- 
tations in the same cluster are then merged. We find it possible for 
every given cluster to choose a consensus sequence that will make 
all other members its polymorphous forms. Hence, we only retain 
one protein sequence for each of the 29, 272 clusters. The total num- 
ber of amino acids associated with these 27, 272 consensus proteins 
is 15,001,326. 

Although we only retain one sequence (the consensus sequence) 
per cluster, the information of other member sequences are still kept. 
For example, when a member sequence and the consensus sequence 
disagree at two sites, the presence of the member sequence is docu- 
mented by introducing two cluster-induced SAPs at the two sites of 
the consensus sequence. The originally annotated SAPs and PTMs 
of the member sequence are also merged into those of the consensus 
sequence. Figure 2 illustrates how this is process is done iteratively. 
In our information file, each SAP or PTM is documented with its 
origin. SAPs arising from clustering are easily distinguished from 
annotated SAPs. For member sequences that are identical to the con- 
sensus sequence, the accession number of those member sequences 
are also recorded with their SAPs/PTMs annotations merged into 
the consensus sequence. When a user selects not to have annotated 
SAPs, RAIdJDbS still allows for cluster-induced SAPs resulting in 
an effective search of the original databases but with minimum red- 
undancy. The strategy employed by RAIdJDbS to search for SAPs 
and PTMs will be briefly described in the RAId_DbS section below. 

The consensus protein in a given cluster is then used as a 
query to BLAST against the NCBFs nr database to retrieve 
its RefSeq accession number and its corresponding Swiss-Prot 
(http://ca.expasy.org/sprot/) accession number, if it exists, from the 
best qualified hit. It is possible for a cluster to have more than one 
accession number. This happens when there is a tie in the qualified 
best hits and when a protein sequence in nr actually is documented 
with more than one accession number. 

To minimize inclusion of less confident annotations, we only 
keep the SAPs and PTMs that are consistently documented in 
more than one source. For example, for proteins with Swiss- 
Prot accession number, we only keep the SAPs and PTMs 
that are annotated both by Swiss-Prot and GeneBank. For pro- 
teins without Swiss-Prot accession numbers, the retentions of 
SAPs and PTMs are described below. The PTM annotations 
are kept only if they are present in the gzipped document 



consensus seq. ...DPR LQRLVADN((N08))GSE ... 

memberseq. . . . DPR {{WOO}). . . LKRLVVDN((N1 1 ))GSE . . . 
updated consensus seq. 

. . . DPR({W00}). . . LQ({K00})RLVA({V00})DN((N08,N1 1))GSE. . . 

Fig. 2. Information-preserved protein clustering example. Once a consen- 
sus sequence is selected, members of the clusters are merged into the 
consensus one-by-one. This figure illustrates how the information of a mem- 
ber sequence is merged into the consensus sequence. The difference in 
the primary sequences between a member and the consensus introduces 
cluster-induced SAPs. In this example, the residues Q and A (in red) in 
the consensus are different from the residues K and V (in blue) in the 
member sequence. As a consequence, K becomes a cluster-induced SAP 
associated with Q and V becomes a cluster-induced SAP associated with 
A at these respective sites of the consensus. The annotated SAP, {WOO}, 
associated with residue R in the member sequence is merged into the con- 
sensus sequence, see the updated consensus sequence in the figure. Note 
that the annotated PTM, ((Nil)), associated with N in the member sequence 
is merged with a different annotated PTM, ((N08)), at the same site of 
the consensus sequence. As mentioned earlier, although the SAPs, PTMs 
are merged, each annotation's origin and disease associations are kept in 
the information file allowing for faithful information retrieval at the final 
reporting stage of the RAId_DbS program. 



HPRD_FLAT_FILES_090107.tar.gz of the Human Protein Refe- 
rence Database: http://www.hprd.org/download. The SAP annota- 
tions are kept only if they are in agreement with the master table, 
SNP_mRNA_pos.bcp.gz (last updated 01/10/2007), of dbSNP: 
ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/ 
/database/organism_data. 

RAId_DbS 

Taking into account the finite sample effect and skewness, the 
form of asymptotic score statistics (P-values) of RAId_DbS (Alves 
et al, 2007a) is derived theoretically. Since the skewness varies 
per spectrum, the parameters for our theoretical distribution are 
spectrum-specific. For each spectrum considered, our theoretical 
distribution (used to compute P-value) mostly agrees well with 
the score histogram accumulated. The final P-value for each pep- 
tide hit, however, is obtained by multiplying the peptide's P-value 
by the number of peptides of its category. As a specific example, 
when Trypsin is used as the digesting enzyme, RAId_DbS allows 
for incorrect N-terminal cleavages. RAId_DbS has internal coun- 
ters, C c and dnc, counting respectively the number of scored 
peptides with correct and incorrect N-terminal cleavage. In general, 
Cine 3> C c - When calculating the P-value of a peptide with correct 
N-terminal cleavage, RAId_DbS multiplies the peptide's P-value by 
C c - However, the P-value of a peptide with incorrect N-terminal 
cleavage will be obtained by multiplying the peptide's P-value by 
C c + Cine (Alves et al., 2007a). In line with the Bonferroni correc- 
tion, our approach avoids overstating the significance of a hit from a 
larger effective database (the pool of peptides regardless of whether 
the N-terminal cleavage is correct) versus a hit from a smaller effec- 
tive database (the pool of peptides with correct N-terminal cleavage 
only). 

The same idea is used in the augmented RAId_DbS. That is, dif- 
ferent counters are set up to record the number of scored peptides 
in different categories. As a specific example, when novel SAPs are 
allowed, RAId_DbS creates a new counter, C nove i_ Bap , to record the 
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number of scored peptides with a novel SAP. This is in general a 
much larger number than other counters. When one calculates the 
P-value associated with a peptide hit that contains a novel SAP, one 
will multiply the peptide's P-value by the sum of a number of cout- 
ners with Cnovei.sap included. However, in the same search, for a 
peptide without novel SAP, its P-value is obtained by multiplying 
the peptide's P-value by the sum of a number of counters exclu- 
ding Cnoveisap- The same approach is applied to PTMs and other 
annotations. 

Below we briefly sketch how RAIdJDbS deals with the presence 
of annotated SAPs, PTMs as well as novel SAPs. In our data- 
base format, annotated SAPs and PTMs are inserted right after the 
site of variation. When searching the database for peptides with 
parent ion mass 1500 Da, RAId_DbS sums the masses of amino 
acids within each possible peptide to see if the total mass is within 
3 Da of 1500 Da. At this stage, sites with variations will have, 
instead of a fixed mass, several possible masses depending on the 
number of SAPs/PTMs are annotated at these sites. Each peptide 
fragmnent covering some of those sites will therefore have several 
effective masses, each corresponding to a specific arrangement of 
SAPs/PTMs. If some of these masses happens to be within 3 Da 
of 1500 Da, RAIdJDbS will score this peptide with corresponding 
annotated SAPs/PTMs that give rise to the proper masses. If none 
of these masses are within the allowed molecular mass range, that 
peptide will not be scored. Note that this approach is computatio- 
nally efficient in terms of mass selection. For example, if a peptide 
contains a site with annotated SAPs/PTMs, one computes the mass 
of this peptide by summing once the amino acid masses of other 
sites. It is then a simple matter to see whether the addition of this 
sum to the list of masses associated with the site with SAPs/PTMs 
may fall in the desriable mass range. This approach is particularly 
powerful when there are more than one site with SAPs/PTMs in 
the peptide considered. The combinatorics associated with two sites 
with SAPs/PTMs only result in a longer list of possible masses to be 
added to the mass sum of unvaried sites. This should be constrasted 
with methods that incorporate SAPs via appending polymorphous 
peptides to the end of the primary sequence. In the latter approach, 
the program needs to do the mass sum multiple times, repeating the 
mass sum of unvaried sites, and thus may significantly slow down 
the searches. 

Despite RAId_DbS's strategic advantage, introduction of 
SAPs/PTMs does increase the complexity of the algorithm. The- 
refore, we limit per peptide the maximum number of annotated 
SAPs to be 2 and the maximum number of annotated PTMs to 
be 5. To facilitate discovery, RAIdJDbS also permits novel SAPs, 
but limited to one novel SAP per not-yet-annotated peptide, mea- 
ning peptides that do not contain any annotated SAPs/PTMs. This 
is because the introduction of novel SAP largely expand the search 
space, and if one allows novel SAPs on peptides already documen- 
ted with SAPs/PTMs, the search space expansion will be even larger 
and may render the search intractable. Currently, the novel SAP is 
expedited via a pre-computed list of amino acid mass difference. As 
an example, assume that one is searching for a peptide with parent 
ion mass 1500 Da, and a not-yet-annotated candidate peptide has 
mass 1477 Da, 23 Da smaller than the target mass. It happens that 
23 Da is also the mass difference between Tryptophan and Tyro- 
sine, and if the candidate peptide contains a Tyrosine, RAIdJDbS 
will replace that Tyrosine with a Tryptophan and score the new pep- 
tide. If the candidate peptide contains two Tyrosines, RAIdJDbS 



will replace one Tyrosine at a time with a Tryptophan and score 
both the new peptides. It is evident that the complexity grows fast if 
one were to allow for two novel SAPs per petpide. 

It is commonly believed that when searching in a larger data- 
base, one is bound to loose sensitivity. This may be true if the 
P-value for every hit is obtained by multiplying the peptide's P- 
value by the same number regardless of the category that peptide 
belongs to. As we have explained earlier, RAIdJDbS does not do 
that. It uses a method equivalent to Bonferroni correction. We use 
P-values to rank peptide hits and each peptide's P-value is obtai- 
ned by multiplying its P-value by the corresponding size of the 
effective database that the peptide belongs to. Consequently, pep- 
tide hits falling in a category that has a large effective database 
size essentially need to have smaller P-values than those of pep- 
tide hits falling in a category that has a small effective database 
size. In Figure 3, we show the Receiver Operating Characteristic 
(ROC) curves when analyzing the training 54 spectra of PEAKS. 
In this data set, there are 17 spectra from yeast, 23 spectra from 
bovine, and 14 spectra from horse. The true positive proteins are 
already provided by PEAKS. We search the spectra generated by 
proteins of yeast, bovine, and horse respectively in the databases 
fo yeast, bovine, and horse. Since the true positive proteins are 
already known, it is relatively easy to perform the ROC analysis 
using the search results from the 54 spectra. There are three ROC 
curves shown in Figure 3, one for searches without SAPs/PTMs, one 
for searches allowing annotated SAPs/PTMs, and one for searches 
allowing both annotated SAPs/PTMs as well as novel SAPs. 
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Fig. 3. ROC curves for three different search strategies employed when 
running RAId_DbS. 
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Summary of Organismal Databases Constructed 

So far, we have finished constructing databses for 17 organisms. We 
summarize these database in Table 2 below. Note that the disease 



information is only available for the human database. For human 
database, we have 123, 464 SAPs and 81, 984 PTMs. Out of those 
SAPs and PTMs, 15, 787 of them have disease associations. 
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Table 2. Summary of Augmented Organismal Databases Searchable by RAIcLDbS. 



Organism 


DB_name 


SAPs included 


PTMs included 


DB.size (byte) 


Homo sapiens 


hsa.seq 


123464 


81984 


16,292,193 


Anopheles gambiae 


angam.seq 


350 


50 


6,042,277 


Arabidopsis thaliana 


artha.seq 


5207 


11977 


12,318,213 


Bos taunts 


botau.seq 


3295 


15810 


11,188,490 


Caenorhabditis elegans 


caele.seq 


1045 


7756 


10,050,609 


Canis famUiaris 


cafam.seq 


2766 


4196 


18,458,474 


Danio rerio 


darer.seq 


7358 


3841 


14,477,794 


Drosophila melanogaster 


drmel.seq 


5611 


9290 


9,796,785 


Equus caballus 


eqcab.seq 


485 


1045 


9,404,150 


Gallus gallus 


gagal.seq 


1109 


6522 


8,728,501 


Macaca mulatto 


mamul.seq 


1370 


1262 


14,498,187 


Mus musculus 


mumus.seq 


27614 


61684 


14,363,491 


Oryza sativa 


orsat.seq 


1291 


2182 


10,679,924 


Pan troglodytes 


patro.seq 


5201 


3734 


20,227,873 


Plasmodium falciparum 


plfal.seq 


56 


184 


3,995,386 


Rattus norvegicus 


ranor.seq 


9297 


33240 


15,879,569 


Saccharomyces cerevisiae 


sacer.seq 


5507 


13220 


2,927,330 
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